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DESCRIPTION 

t9.Nov. 2002 

A method of clustering a set of records 



Field of the invention 

The present invention relates to the field of data processing, 
and more particularly without limitation, to the field of data 
clustering . 

Background and prior art 

Clustering of data is a data processing task in which clusters 
are identified in a structured set of raw data. Typically the 
raw data consists of a large set of records each record having 
the same or a similar format. Each field in a record can take 



any of a number of logical, categorical, or numericaFValues . 
Data clustering aims to group such records into clusters such 
that records belonging to the same cluster have a high degree of 
similarity. 

A variety of algorithms is known for data clustering. The K- 
means algorithm relies on the minimal sum of Euclidean distances 
to center of clusters taking into consideration the number of 
clusters. The Kohonen - algor i thm is based on a neural net and 
also uses Euclidean distances. IBM's demographic algorithm 
relies on the sum of internal similarities minus the sum of 
external similarities as a clustering criterion. Those and other 
clustering criteria are utilized in an iterative process of 
finding clusters . 

A common disadvantage of such prior art clustering methods is 
that they are computationally expensive and require a lot of 
computing power. This is especially true for very large data 
sets . 
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It is therefore an object of the present invention to provide an 
improved method of clustering which requires less computing 
power. It is a further object of the present invention to 
provide a corresponding computer program product and data 
proces s ing sys tern . 

Summary of the invention 

The underlying problem of the invention is solved basically by 
applying the features laid down in the respective independent 
claims. Preferred embodiments of the invention are given in the 
dependant claims . 

In essence the present invention provides for a computationally 
inexpensive method for accurately clustering of data records 
containing structured raw data. Each of the data records 
contains a sequence of attribute values of corresponding 
attributes . For - each of the attributes of the structured set of 
raw data contained in the records a characteristic value is 
calculated by evaluating the attribute values of that attribute 
across the data records. For each of the attribute values a 
deviation from the corresponding characteristic value is 
calculated. Next the attributes of each record are sorted based 
on the deviations to provide a sequence of attributes which is 
then used as a key for clustering. 

In accordance with a preferred embodiment of the invention the 
mean value or the median value of the attribute values of a 
certain attribute across the data records is calculated to 
provide the characteristic value. 

In accordance with a further preferred embodiment of the 
invention the deviation of an attribute value is calculated by 
determining the difference between the attribute value and the 
corresponding characteristic value. The difference should then 
be normalized, preferably by dividing by that characteristic 
value . 
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In accordance with a further preferred embodiment of the 
invention the attributes of a record are sorted using the 
corresponding deviations for the evaluation of a sorting 
criterion. For example the attributes with their corresponding 
deviations are sorted in ascending or descending order. 
Preferably the same sorting criterion is applied for all 
considered records. 

In accordance with a further preferred embodiment of the 
invention the clustering is performed based on the keys provided 
by sorting the attributes of the records. A user may select a 
criterion of a given number of criteria for evaluation of the 
keys for clustering of the data records. For example all data 
records which have the same first m attributes are put into the 
same cluster considering or not considering the sign of the 
deviations . 

In accordance with a further preferred embodiment of the 
invention, the clustering result is refined by searching of best 
matching keys in other clusters for the records of the smallest 
cluster. This way the records contained in the smallest cluster 
are distributed to other clusters such that the total number of 
clusters is reduced. For identification of other clusters for a 
record in the smallest cluster a distance measure such as a 
Euclids distance can be utilised. 

In addition or as an alternative to Euclids distance gravitation 
can be used for reducing the number of 

clusters (http: //www. ticam.utexas . edu/~zeyun/pick.htm) . 

The present invention is particularly advantageous in that it 
provides an efficient and computationally inexpensive way to 
analyse the characteristics of unknown data. It is a further 
particular advantage of the invention that performance of the 
clustering method only needs two passes over the data. 
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In the following preferred embodiments of the invention are 
described in greater detail by making reference to the drawings 
in which: 

Figure 1 is illustrative of a flow chart of a preferred 
embodiment of a method of the invention, 

Figure 2 is a block diagram of a preferred embodiment of a 
computer system of the invention. 

Detailed description 

Figure 1 shows a flow chart for performing a method of 
clustering of data records containing structured raw data. 
Given are n records r lf ... 9 r n with k numeric attributes ai...,a k , 
where aidrj) is the value of the i-th attribute of the j-th 
record. In step 100 a characteristic value is calculated for 
each of the attributes. For a given attribute this is done by 
calculating a projection of the attribute values of this 
attribute across the records. 

For example, the mean value is calculated as a characteristic 
value for each one of the attributes: For each attribute a x , 1 = 
l,...,k, calculate the mean value ju over all records 

M (ai) = (1) 

Instead of the mean values the median values can be calculated. 
The median value is calculated by determining the difference 
between a maximum attribute value of a considered attribute and 
a minimum attribute value of the considered attribute over all 
records divided by two. Alternatively any other equivalent of a 
mean or median value can be calculated instead. By means of 
such mean values, median values or equivalent values 
characteristic values are provided for each one of the 
attributes . 
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In step 102 the deviations of each attribute value of a 
considered record from the corresponding characteristic value 
are determined. For example the deviation of an attribute value 
from its characteristic value can be performed by calculating 
the difference between the attribute value and its 
characteristic value. Preferably the difference is divided by 
the characteristic value. 

In step 104 the deviations which have been obtained for each of 
the records are used as a basis for sorting the attributes of 
this record. For example the attributes are sorted in ascending 
or descending order of the deviations . This way a key 
comprising an ordered list of attributes and associated 
deviations is provided for each one of the records. 

Preferably the steps 102 and 104 are carried out as follows: 

~~1~T Consider record r± . 

2 . Consider attribute aj . 

3. Calculate the deviation aj(ri) of aj(ri) from the 
respective mean of attribute aj . This can be done by, but 
is not limited to 

a^r.) -Mfr i CO) 
a3(ri)= J 1 - J 1 (2) 

or any other deviation formula. 

4. Repeat this for all attributes ai,..., a k of the record r± 
by applying steps 2 and 3 . 

5. Rank the deviations | a x (ri) | , ... , | ^(ri) | f rom the largest to 
the smallest, holding a l% (/;),... , a lk (r s ) . This ranking shows 

which attributes deviate the most from the mean of all 
records. For example, since andilhas the largest 
deviation from the respective mean value this means 
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that record ri is most significantly set apart from all 
other records by attribute a, ( . The largest value shows 

the biggest deviation from the rest of the data, and hence 
that attribute is very characteristic. 

In step 106 the records. are clustered based on the keys. 

One approach for performing the clustering based on the keys is 
to put records having identical keys into the same cluster. 
However this may result in a too large number of clusters. 

It is therefore preferred to define a similarity criterion. 
When the keys of two records fulfil the similarity criterion the 
records are put into the same cluster. 

Let , ... , a lk (/;) be the ranking, i.e. the key, of record ri and 

\{rj) , ... , a lk {rj) be the ranking of record r j _ 

Some examples for preferred criteria are given in the following: 

Criterion A: n and rj belong to the same cluster if the first m 
attributes of the respective keys are identical and share the 
same sign. For example, if the three most significant 
attributes (m = 3) are considered, the ranking of record ri 
is 

(a 7 (ri) ,a 2 (ri) ,a 3 (ri) ,a 9 (ri) , ...) = -1 . 17 , 0 . 95 , 0 . 87 , 0 . 56 , ... 
and the ranking of rj is 

(a 7 (rj) ,a 2 (rj) ,a 3 (rj) ,ai(rj) ,...) = -1.46,1.09,0.89,0.88,.... 

The records r± and rj belong to the same cluster, as the first 
three attributes of the keys are identical as well as the signs 
of the values . 
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But if the ranking of r k was (a 7 (r k ) , a 2 (r k ) , a 3 (r k ) , ... ) = -1.46, - 
1.09, 0.89, 0.88, the ri and r k would belong to different 

sections, because the signs of the second most distinguishing 
attribute 3L2 had a different sign compared to the respective 
value of record ri . 

Criterion B: ri and rj belong to the same cluster if the first m 
attributes are identical. For example, considering the previous 
example, records ri and r k would belong to the same section, 
though the sign of the second most distinguishing attribute is 
different. 

Criterion C: ri and rj belong to the same section if the same 
attributes appear on the first m positions with identical signs. 
This criterion ignores the order in which the attributes appear. 

For example, if m = 3, n like before and the ranking of rj is 
a 2 (rj) , a 3 (r-j) , (a 7 (rj) , ai (rj ),..., ) = 0 . 72 , 0 . 68, - 0 . 42 , 0 . 37 , . a 2 , a 3 
and a 7 are identical and share the same signs. 

This criterion can be varied with ignoring the signs. 

The resulting clustering can be further refined by reducing the 
number of the clusters. For example it can be desirable to 
dissolve a cluster having a small size, i.e. having a small 
number of records. This can be done by means of the following 
iterative process: 

1. Rank the clusters by size 

2. Select the smallest cluster 

3. For each record of the cluster, find the one of the larger 
clusters that matches most of the significant attributes. 
If more than one cluster has to be considered, either 
choose the largest of these clusters or use some kind of 
distance measure to find the nearest cluster. 
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4. Repeat until the desired number of clusters has been 

reached or if the similarity of records and clusters is too 
small . 

Figure 2 shows a corresponding data processing system 200. Data 
processing system 200 has a database 202 for storing records of 
structured data. Each of the records has attribute values ai,... 
,ak- Each of the records has an associated data field for storing 
a key for that record and a data field for storing a cluster 
identifier. Initially the key- and cluster data fields are 
empty . 

Further data processing system 200 has a module 2 04 for 
calculating of characteristic values for each one of the 
attributes. The calculation of the characteristic values can be 
performed as explained with respect to step 100 of figure 1. 

Further, data processing system 200 has module 206 for 
calculation of the deviations of the attribute values. This 
calculation can be performed in accordance with above equation 
(2) . 

Module 2 08 of the data processing system 200 is used for sorting 
of the attributes of the data records by applying a sorting 
criterion on the deviations of the corresponding attribute 
values. This way a ranking of the deviations can be obtained 
for each record. The sorting can be performed as explained with 
respect to step 104 of figure 1. 

Further data processing system 200 has modules 210, 212 and 214 
for application of the respective criteria A, B and C. The 
criteria A, B and C are described above with respect to figure 
1. 

Further there is a user interface 216. By means of the user 
interface 216 the tabular data contained in database 2 02 can be 
visualised. Further a user can select a subset of the records 
contained in the database 202 for performing a clustering 
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operation. Before the data clustering is performed the user 
selects one of the pre-defined clustering criteria A, B or C. 
Alternatively the user can define a user specific clustering 
criterion. 

After the user has selected the set of records of the database 
2 02 on which the data clustering is to be performed and after a 
criterion for data clustering has been selected or specified, 
the data clustering is initiated. 

Firstly, the module 204 is invoked to calculate the 
characteristic values of the attributes. Next the module 206 is 
invoked to calculate the deviations of the attribute values from 
their corresponding characteristic values . By means of module 
208 the attributes are sorted to provide a key for each one of 
the selected records. Next the module for applying the selected 
criterion is invoked, i.e. module 210 for applying criterion A, 
module 212 for applying criterion B or module 214 for applying 
criterion C. Alternatively a user specified module is invoked 
to apply the user specified criterion. As a result of the 
application of the selected or specified criterion the selected 
records are clustered. Records which are put into the same 
cluster are assigned the same cluster identifier which is 
entered into the corresponding data field within database 202 . 
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CLAIMS £P 0-Wun\t*> 

\ 9. Nov. 2002 

A method of clustering a set of records, each of the records 
having attribute values of a set of attributes, the method 
comprising the steps of: 

for each attribute of the set of attributes: determining a 
characteristic value for that attribute based on the 
attribute values of that attribute, 

for each attribute value: determining a deviation from the 
characteristic value of the corresponding attribute, 

for each record: sorting of the attributes based on the 
deviations to provide a key, 

----- clustering of the records based on ~the~key~ 

The method of claim 1, whereby a mean value of the attribute 
values of that attribute is calculated as the characteristic 
value . 

The method of claim 1 or 2, whereby a median value of the 
attribute values of that attribute is determined as the 
characteristic value . 

The method of claim 1, 2 or 3, whereby the deviation is 
calculated based on a difference between that attribute value 
and the characteristic value of the corresponding attribute. 

The method of any one of the preceding claims 1 to 4, whereby 
the deviation is calculated by calculating the difference 
between that attribute value and the characteristic value of 
the corresponding attribute, and by dividing the difference 
by the characteristic value of the corresponding attribute. 
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The methods of any one the preceding claims 1 to 5, whereby 
the absolute values of the deviations of the attributes are 
used as a sorting criterion. 

The method of any one of the preceding claims 1 to 6, whereby 
a first one of the set of records having the first key and a 
second one of the set of records having a second key are put 
into the same cluster, if the first and the second keys have 
identical sub-sequences of a first length. 

The method of any one of the preceding claims 1 to 7, a first 
one of the records of the set of records having the first key 
and a second record of the set of records having a second key 
are put into the same cluster, if the first and second keys 
contain identical sub-sequences of absolute values of the 
deviations . 

The method of any one of the preceding claims 1 to 8, whereby 
a first record of the set of records having a first key and a 
second record of the set of records having a second key are 
put into the same cluster, if the first key has a first sub- 
sequence and the second key has a second sub- sequence, the 
first, and second sub- sequences comprising the same, set .of 
attributes . 

The method of any one of the preceding claims 1 to 9, whereby 
a first record of the set of records having a first key and a 
second record of the set of records having a second key are 
put into the same cluster, if the first key has a first sub- 
sequence and the second key has a second sub- sequence and if 
the first and second sub- sequences comprise the same 
attributes irrespective of a sign of the deviations of the 
attributes . 

The method of any one of the preceding claims 1 to 10, 
further comprising : 
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identifying a cluster having the smallest number of 
records, 

for each record of the identified cluster: searching 
another cluster having records with best matching keys. 

12. The method of claim 11, whereby the length of the sub- 
sequences is reduced for finding a best match. 

13. The method of claims 11 or 12, whereby a distance measure is 
used to find another cluster for a record of the identified 
cluster. 

14. The method of claim 13, whereby the distance measure is 
Euclids distance. 

15. The method of any one of claims 11 to 14, whereby gravitation 
is used for r educin g the number of clusters . 

16. A computer program product, such as a digital storage medium, 
comprising computer program means for performing a method in 
accordance with any one of the preceding claims 1 to 15 . 



17. A data processing system comprising processing means for 
performing a method in accordance with any one of the 
preceding claims 1 to 15 . 
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The invention relates to a method of clustering a set of records, 
each of the records having attribute values of a set of 
attributes, the method comprising the steps of: 

for each attribute of the set of attributes : determining a 
characteristic value for that attribute based on the 
attribute values of that attribute, 

for each attribute value: determining a deviation from the 
characteristic value of the corresponding attribute, 

- " for each record: sorting of the— at tributes based on the 
deviations to provide a key, 

clustering of the records based on the key. 



(Figure 1) 
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Determine projections of attribute values of records to 
provide characteristic value for each attribute 
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For each record: 

Determine deviation of each attribute value from 
corresponding characteristic value 
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For each record:: 

Sort deviations of attribute values of 
that record to provide key 
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